CONOUBWN FE 


. Introduction to Concise Signal Models 

. Signal Dictionaries and Representations 
. Manifolds 

. Low-Dimensional Signal Models 

. Approximation 

. Compression 

. Dimensionality Reduction 

. Compressed Sensing 


Introduction to Concise Signal Models 

This collection reviews fundamental concepts underlying the use of concise 
models for signal processing. Topics are presented from a geometric 
perspective and include low-dimensional linear, sparse, and manifold-based 
signal models, approximation, compression, dimensionality reduction, and 
Compressed Sensing. 


Overview 


In characterizing a given problem in signal processing, one is often able to 
specify a model for the signals to be processed. This model may distinguish 
(either statistically or deterministically) classes of interesting signals from 
uninteresting ones, typical signals from anomalies, information from noise, 
etc. 


Very commonly, models in signal processing deal with some notion of 
structure, constraint, or conciseness. Roughly speaking, one often believes 
that a signal has “few degrees of freedom” relative to the size of the signal. 
This notion of conciseness is a very powerful assumption, and it suggests 
the potential for dramatic gains via algorithms that capture and exploit the 
true underlying structure of the signal. 


In these modules, we survey three common examples of concise models: 
linear models, sparse nonlinear models, and manifold-based models. In 
each case, we discuss an important phenomenon: the conciseness of the 
model corresponds to a low-dimensional geometric structure along which 
the signals of interest tend to cluster. This low-dimensional geometry again 
has important implications in the understanding and the development of 
efficient algorithms for signal processing. 


We discuss this low-dimensional geometry in several contexts, including 
projecting a signal onto the model class (i.e., forming a concise 
approximation to a signal), encoding such an approximation (i.e., data 
compression), and reducing the dimensionality of signals and data sets. We 
conclude with an important and emerging application area known as 
Compressed Sensing (CS), which is a novel method for data acquisition that 
relies on concise models and builds upon strong geometric principles. We 


discuss CS in its traditional, sparsity-based context and also discuss 
extensions of CS to other concise models such as manifolds. 


General Mathematical Preliminaries 


Signal notation 


We will treat signals as real- or complex-valued functions having domains 
that are either discrete (and finite) or continuous (and either compact or 
infinite). Each of these assumptions will be made clear as needed. As a 
general rule, however, we will use z to denote a discrete signal in R™ and f 
to denote a function over a continuous domain Y. We also commonly refer 
to these as discrete- or continuous-time signals, though the domain need not 
actually be temporal in nature. 


Lp and Ip norms 


As measures for signal energy, fidelity, or sparsity, we will employ the L,, 
and £, norms. For continuous-time functions, the LZ, norm is defined as 


Equation: 
1/p 
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and for discrete-time functions, the £, norm is defined as 
Equation: 
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where __ denotes the indicator function. (While we often refer to these 
measures as “norms,” they actually do not meet the technical criteria for 
norms when p < 1.) 


Linear algebra 


Let A be areal-valued M x N matrix. We denote the nullspace of A as 
WN (A) (note that (A) is a linear subspace of R™), and we denote the 
transpose of A as A’. 


We call A an orthoprojector from R™ to R™ if it has orthonormal rows. 
From such a matrix we call A’ A the corresponding orthogonal projection 
operator onto the M-dimensional subspace of R™ spanned by the rows of 


A. 


Signal Dictionaries and Representations 

This collection reviews fundamental concepts underlying the use of concise 
models for signal processing. Topics are presented from a geometric 
perspective and include low-dimensional linear, sparse, and manifold-based 
signal models, approximation, compression, dimensionality reduction, and 
Compressed Sensing. 


For a wide variety of signal processing applications (including analysis, 
compression, noise removal, and so on) it is useful to consider the 
representation of a signal in terms of some dictionary [link]. In general, a 
dictionary W is simply a collection of elements drawn from the signal 
space whose linear combinations can be used to represent or approximate 
signals. 


Considering, for example, signals in R’, we may collect and represent the 
elements of the dictionary YW as an N x Z matrix, which we also denote as 
W. From this dictionary, a signal « € R% can be constructed as a linear 
combination of the elements (columns) of Y. We write 

Equation: 


zr—Wa 


for some a € R%. (For much of our notation in this section, we concentrate 
on signals in R, though the basic concepts translate to other vector 
spaces.) 


Dictionaries appear in a variety of settings. The most common may be the 
basis, in which case W has exactly N linearly independent columns, and 
each signal x has a unique set of expansion coefficients a = Y +z. The 
orthonormal basis (where the columns are normalized and orthogonal) is 
also of particular interest, as the unique set of expansion coefficients 

a = W~1!y = W' 2 can be obtained as the inner products of x against the 
columns of ¥. That is, a (4) = (xz, ¥;),i = 1,2,---,N, which gives us the 
expansion 

Equation: 


We also have that ||a||3 = S73", (a, vi)”. 


Frames are another special type of dictionary [link]. A dictionary VW is a 
frame if there exist numbers A and B, 0 < A < B < o such that, for any 
signal x 

Equation: 


Allz|l; < 5 (x, pz)” < Blla|l}. 


The elements of a frame may be linearly dependent in general (see [link]), 
and so there may exist many ways to express a particular signal among the 
dictionary elements. However, frames do have a useful analysis/synthesis 
duality: for any frame W there exists a dual frame W such that 

Equation: 


In the case where the frame vectors are represented as columns of the N x 
Z matrix W, the matrix Y containing the dual frame elements is simply the 
transpose of the pseudoinverse of W. A frame is called tight if the frame 
bounds A and B are equal. Tight frames have the special properties of (i) 
being their own dual frames (after a rescaling by 1/A) and (ii) preserving 
norms, i.e., 3+’, (x, Wi)” = Alla||3. The remainder of this section 
discusses several important dictionaries. 
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A simple, redundant frame Y 
containing three vectors that 
span R?. 


The canonical basis 


The standard basis for representing a signal is the canonical (or “spike”) 
basis. In R%, this corresponds to a dictionary ¥ = Iy (the N x N identity 
matrix). When expressed in the canonical basis, signals are often said to be 
in the “time domain.” 


Fourier dictionaries 


The frequency domain provides one alternative representation to the time 
domain. The Fourier series and discrete Fourier transform are obtained by 
letting Y contain complex exponentials and allowing the expansion 
coefficients a to be complex as well. (Such a dictionary can be used to 
represent real or complex signals.) A related “harmonic” transform to 
express signals in RY is the discrete cosine transform (DCT), in which ¥ 


contains real-valued, approximately sinusoidal functions and the 
coefficients @ are real-valued as well. 


Wavelets 


Closely related to the Fourier transform, wavelets provide a framework for 
localized harmonic analysis of a signal [link]. Elements of the discrete 
wavelet dictionary are local, oscillatory functions concentrated 
approximately on dyadic supports and appear at a discrete collection of 
scales, locations, and (if the signal dimension D > 1) orientations. 


Scale 


In wavelet analysis and other settings, we will frequently refer to a 
particular scale of analysis for a signal. Consider, for example, continuous- 


time functions f defined over the domain J = (0, 1]”. A dyadic 
hypercube X, C (0, ie at scale 7 € N is a domain that satisfies 
Equation: 


Xj = [612-7 (81 +1)27] x --- x [8p27, (Bp +1)2 7] 
with 61, B2,---,8pd € {0, 1,---,27- ge We call X; a dyadic interval 


when D = 1 ora dyadic square when D = 2 (see [link]). Note that X; has 
sidelength 2~J. 


Dyadic partitioning of the unit square at scales 
7 = 0,1, 2. The partitioning induces a coarse-to- 
fine parent/child relationship that can be modeled 

using a tree structure. 


For discrete-time functions the notion of scale is similar. We can imagine, 
for example, a “voxelization” of the domain [0, ie (“pixelization” when 
D = 2), where each voxel has sidelength 2-7, B € N, and it takes 22? 
voxels to fill [0, 1] > The relevant scales of analysis for such a signal would 
simply be 7 = 0,1,---, B, and each dyadic hypercube X ; would refer to a 
collection of voxels. 


Wavelet fundamentals 


The wavelet transform offers a multiscale decomposition of a function into 
a nested sequence of scaling spaces Vp C Vj C --- C Vj C - ++. Each 
scaling space V; is spanned by a discrete collection of dyadic translations of 
a lowpass scaling function @;, and the difference between adjacent scaling 
spaces V; and V;,1 is spanned by a discrete collection of dyadic translations 
of a bandpass wavelet function ~;. [link] shows an example of this 
multiscale organization in the case of the Haar wavelet dictionary. Each 
wavelet function at scale 7 is concentrated approximately on some dyadic 
hypercube X ;, and between scales, both the wavelets and scaling functions 


are “self-similar,” differing only by rescaling and dyadic dilation. When 

D > 1, the difference spaces are partitioned into 2? — 1 distinct 
orientations (when D = 2 these correspond to vertical, horizontal, and 
diagonal directions). The wavelet transform can be truncated at any scale 7. 
We then let the basis Y consist of all scaling functions at scale 7 plus all 
wavelets at scales 7 and finer. 


Multiscale wavelet representations on the interval [0, 1]. (a) Haar 
scaling functions spanning V; with 7 = 2. (b) Haar wavelet functions 
spanning the difference space between V; and V;+1. (c) Haar scaling 
functions spanning V;,1. (d) Two example functions belonging to the 

spaces (left) V; and (right) Vj+1. 
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Wavelets are essentially bandpass functions that detect abrupt changes in a 
signal. The scale of a wavelet, which controls its support both in time and in 
frequency, also controls its sensitivity to changes in the signal. This is made 
more precise by considering the wavelet analysis of smooth signals. 
Wavelet are often characterized by their number of vanishing moments; a 
wavelet basis function is said to have H vanishing moments if it is 
orthogonal to (its inner product is zero against) any H-degree polynomial. 
Sparse (Nonlinear) models discusses further the wavelet analysis of smooth 
and piecewise smooth signals. 


The dyadic organization of the wavelet transform lends itself to a 
multiscale, tree-structured organization of the wavelet coefficients. Each 
“parent” function, concentrated on a dyadic hypercube X; of sidelength 2~/ 
, has 2” “children” whose supports are concentrated on the dyadic 
subdivisions of X ;. This relationship can be represented in a top-down tree 
structure, as demonstrated in [link]. Because the parent and children share a 
location, they will presumably measure related phenomena about the signal, 
and so in general, any patterns in their wavelet coefficients tend to be 
reflected in the connectivity of the tree structure. [link] and [link] show an 
example of the wavelet transform applied to the Cameraman test image; 
since the dimension D = 2, each scale is partitioned into vertical, 
horizontal, and diagonal wavelet analysis, and each parent coefficient has 
2? — 4 children. 


50 100 150 200 250 


Cameraman test image (size 256 x 256) for use in 
wavelet decomposition and approximation 
examples. 


Wavelet analysis of the Cameraman test image. (a) One-level wavelet 
transform, where the N-pixel image is transformed into four sets of 
N/4 coefficients each. The top left quadrant represents the scaling 

coefficients at the next coarser scale (relative to the scale of 
pixelization). The remaining quadrants represent the wavelet 
coefficients from the difference spaces, partitioned into the vertical, 
horizontal, and diagonal subbands. (b) Three-level wavelet transform, 
where the wavelet decomposition has been iterated twice more on the 
scaling coefficients. The multiple scales of wavelet coefficients exhibit 

a parent-child dependency. The largest coefficients tend to concentrate 

at the coarsest scales and around high-frequency features such as edges 

in the image. 
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In addition to their ease of modeling, wavelets are computationally 
attractive for signal processing; using a filter bank, the wavelet transform of 
an NV-voxel signal can be computed in just O(V) operations. 


o 


Other dictionaries 


A wide variety of other dictionaries have been proposed in signal 
processing and harmonic analysis. As one example, complex-valued 
wavelet transforms have proven useful for image analysis and modeling 
[link], [link], [link], [Link], [link], [Link], [link], thanks to a phase 
component that captures location information at each scale. Just a few of 
the other harmonic dictionaries popular in image processing include 
wavelet packets [link], Gabor atoms [link], curvelets [link], [link], and 
contourlets [link], [link], all of which involve various space-frequency 
partitions. We mention additional dictionaries in Compression . 


Manifolds 

This collection reviews fundamental concepts underlying the use of concise 
models for signal processing. Topics are presented from a geometric 
perspective and include low-dimensional linear, sparse, and manifold-based 
signal models, approximation, compression, dimensionality reduction, and 
Compressed Sensing. 


As we will soon discuss, manifold models can provide an alternative to 
signal dictionaries as a framework for concise signal modeling. In this 
module, we present a minimal set of definitions and terminology from 
differential geometry and topology that serve as an introduction to 
manifolds. We refer the reader to the introductory and classical texts [link], 
[link], [Link], [link] for more depth and technical precision. 


General terminology 


A K-dimensional manifold .@ is a topological space[footnote] that is 
locally homeomorphic[footnote] to R“ [link]. This means that there exists 
an open cover of .@ with each such open set mapping homeomorphically to 
an open ball in R*. Each such open set, together with its mapping to R* is 
called a chart; the set of all charts of a manifold is called an atlas. 

A topological space is simply a set X, together with a collection T' of 
subsets of X called open sets, such that: (i) the empty set belongs to T’, (ii) 
X belongs to 7’, (iii) arbitrary unions of elements of 7’ belong to T, and (iv) 
finite intersections of elements of 7’ belong to T’. 

A homeomorphism is a function between two topological spaces that is 
one-to-one, onto, continuous, and has a continuous inverse. 


The general definition of a manifold makes no reference to an ambient 
space in which the manifold lives. However, as we will often be making use 
of manifolds as models for sets of signals, it follows that such “signal 
manifolds” are actually subsets of some larger space (for example, of 

L» (IR) or RY). In general, we may think of a K-dimensional submanifold 
embedded in R as a nonlinear, K-dimensional “surface” within RY. 


Examples of manifolds 


One of the simplest examples of a manifold is simply the circle in R?. A 
small, open-ended segment cut from the circle could be stretched out and 
associated with an open interval of the real line (see [link]). Hence, the 
circle is a 1-D manifold. (We note that at least two charts are required to 
form an atlas for the circle, as the entire circle itself cannot be mapped 
homeomorphically to an open interval in R?.) 


eee 


A circle is a manifold because there exists an open cover 
consisting of the sets U;, U2, which are mapped 
homeomorphically onto open intervals in the real line via the 
functions 1, 9. (It is not necessary that the intervals intersect in 
R.) 


We refer the reader to [link] for an excellent overview of several manifolds 
with relevance to signal processing, including the rotation group SO(3), 
which can be used for representing orientations of objects in 3-D space, and 
the Grassman manifold G(r, N), which represents all AK-dimensional 
subspaces of IR. (Without working through the technicalities of the 
definition of a manifold, it is easy to see that both types of data have a 
natural notion of neighborhood.) 


Tangent spaces 


A manifold is differentiable if, for any two charts whose open sets on .@ 
overlap, the composition of the corresponding homeomorphisms (from R* 


in one chart to .@ and back to R* in the other) is differentiable. (In our 
simple example, the circle is a differentiable manifold.) 


To each point z in a differentiable manifold, we may associate a K- 
dimensional tangent space ‘T'an,. For signal manifolds embedded in L or 
RX, it suffices to think of Tan, as the set of all directional derivatives of 
smooth paths on .@ through z. (Note that Tan, is a linear subspace and 
has its origin at 0, rather than at z.) 


Distances 


One is often interested in measuring distance along a manifold. For abstract 
differentiable manifolds, this can be accomplished by defining a 
Riemannian metric on the tangent spaces. A Riemannian metric is a 
collection of inner products (, ),, defined at each point x € -@. The inner 
product gives a measure for the “length” of a tangent, and one can then 
compute the length of a path on .@ by integrating its tangent lengths along 
the path. 


For differentiable manifolds embedded in RY, the natural metric is the 
Euclidean metric inherited from the ambient space. The length of a path 
y : [0,1] ++ @ can then be computed simply using the limit 
Equation: 


length (7) =lim ) / lly(é/3) — 1(@— 1)/d) le 


The geodesic distance d_y (x, y) between two points x, y € @ is then 
given by the length of the shortest path 7 on .@ joining x and y. 


Condition number 


To establish a firm footing for analysis, we find it helpful assume a certain 
regularity to the manifold beyond mere differentiability. For this purpose, 
we adopt the condition number defined recently by Niyogi et al. [link]. 


Definition 


[link] Let H be a compact submanifold of IR. The condition number of 
M is defined as 1/7, where 7 is the largest number having the following 
property: The open normal bundle about .@ of radius r is imbedded in RY 
for all r < T. 


The open normal bundle of radius r at a point x € -@ is simply the 
collection of all vectors of length < r anchored at x and with direction 
orthogonal to Tan,. 


In addition to controlling local properties (such as curvature) of the 
manifold, the condition number has a global effect as well, ensuring that the 
manifold is self-avoiding. These notions are made precise in several 
lemmata, which we repeat below for completeness. 

Lemma 


[link] If W is a submanifold of R% with condition number 1 /r, then the 
norm of the second fundamental form is bounded by 1/7 in all directions. 


This implies that unit-speed geodesic paths on .@ have curvature bounded 
by 1/7. The second lemma concerns the twisting of tangent spaces. 
Lemma 


[link] Let @ be a submanifold of R* with condition number 1/r. Let 
p,q € & be two points with geodesic distance given by d_y (p, q). Let 0 
be the angle between the tangent spaces ‘Tan, and Tan, defined by 

cos (0) =MINycTan, MAaXycTan, |(u, v)|- Then cos (0) pis tdi (p, q). 


The third lemma concerns self-avoidance of .Z. 
Lemma 


[link] Let @ be a submanifold of R* with condition number 1/r. Let 
p,q € & be two points such that ||p — q||, = d. Then for all d < 7/2, the 


geodesic distance dv (p,q) is bounded by d.yv (p,q) < tT — 71/1 — 2d/r. 


From [link] we have an immediate corollary. 


Corollary 


Let .@ be a submanifold of R* with condition number 1/r. Let p,q € @ 
be two points such that ||p — q||, = d. Ifd < 7/2, then 


2 
d>dy(p,q)—- “ee. 


Low-Dimensional Signal Models 

This collection reviews fundamental concepts underlying the use of concise 
models for signal processing. Topics are presented from a geometric 
perspective and include low-dimensional linear, sparse, and manifold-based 
signal models, approximation, compression, dimensionality reduction, and 
Compressed Sensing. 


We now survey some common and important models in signal processing, 
each of which involves some notion of conciseness to the signal structure. 
We see in each case that this conciseness gives rise to a low-dimensional 
geometry within the ambient signal space. 


Linear models 


Some of the simplest models in signal processing correspond to linear 
subspaces of the ambient signal space. Bandlimited signals are one such 
example. Supposing, for example, that a 27-periodic signal f has Fourier 
transform F'(w) = 0 for |w| > B, the Shannon/Nyquist sampling 
theorem [link] states that such signals can be reconstructed from 2B 
samples. Because the space of B-bandlimited signals is closed under 
addition and scalar multiplication, it follows that the set of such signals 
forms a 2B-dimensional linear subspace of L? ([0, 27)). 


Linear signal models also appear in cases where a model dictates a linear 
constraint on a signal. Considering a discrete length-N signal x, for 
example, such a constraint can be written in matrix form as 

Equation: 


Az =0 


for some M x N matrix A. Signals obeying such a model are constrained 
to live in.W(A) (again, obviously, a linear subspace of R’). 


A very similar class of models concerns signals living in an affine space, 
which can be represented for a discrete signal using 
Equation: 


Ax = y. 


The class of such z lives in a shifted nullspace + VW (A), where Z is any 
solution to the equation Az = y. 


Revisiting the dictionary setting (see Signal Dictionaries and 
Representations), one last important linear model arises in cases where we 
select K specific elements from the dictionary W and then construct signals 
using linear combinations of only these K elements; in this case the set of 
possible signals forms a K-dimensional hyperplane in the ambient signal 
space (see [link](a)). 


Simple models for signals in R?. (a) The linear space spanned by one 
element of the dictionary YW. The bold vectors denote the elements of 
the dictionary, while the dashed line (plus the corresponding dictionary 
element) denotes the subspace spanned by that dictionary element. 
(b) The nonlinear set of 1-sparse signals that can be built using VW. 
(c) A manifold .Z. 


x(2) 


x(1) 


For example, we may construct low-frequency signals using combinations 
of only the lowest frequency sinusoids from the Fourier dictionary. Similar 
subsets may be chosen from the wavelet dictionary; in particular, one may 
choose only elements that span a particular scaling space V;. As we have 
mentioned previously, harmonic dictionaries such as sinusoids and wavelets 
are well-suited to representing smooth[footnote] signals. This can be seen in 
the decay of their transform coefficients. For example, we can relate the 
smoothness of a continuous 1-D function f to the decay of its Fourier 
coefficients F'(w); in particular, if f |F (w)|(1+ |w|#)dw < oo, then 

f © @* [link]. In order to satisfy f JF (w)|(1 + Jw |) dw < 00, a signal 


must have a sufficiently fast decay of the Fourier transform coefficients 
|F (w)| as w grows. Wavelet coefficients exhibit a similar decay for smooth 
signals: supposing f € @” and the wavelet basis function has at least H 
vanishing moments, then as the scale 7 — oo, the magnitudes of the 
wavelet coefficients decay as goa +1/2) [link]. (Recall that f € oe 
implies f is well-approximated by a polynomial, and so due the vanishing 
moments this polynomial will have zero contribution to the wavelet 
coefficients.) 

Lipschitz smoothness We say a continuous-time function of D variables 
has smoothness of order H > 0, where H = r + p, r is an integer, and 

vy € (0, 1], if the following criteria are met [link], [link]: 


e All iterated partial derivatives with respect to the D directions up to 
order 7 exist and are continuous. 

e All such partial derivatives of order r satisfy a Lipschitz condition of 
order v (also known as a Hdlder condition).(A function d € Lip(v) if 
|d (t; + t2) — d(t,)|< C || te ||” for all D-dimensional vectors ty, t2.) 


We will sometimes consider the space of smooth functions whose partial 
derivatives up to order r are bounded by some constant {2. With somewhat 
nonstandard notation, we denote the space of such bounded functions with 
bounded partial derivatives by @“, where this notation carries an implicit 
dependence on 2. Observe that r = | H — 1], where |-] denotes rounding 
up. Also, when H is an integer @ 7 includes as a subset the space 
traditionally denoted by the notation “C”” (the class of functions that have 
Hf =r-+1 continuous partial derivatives). 


Indeed, these results suggest that the largest Fourier or wavelet coefficients 
of smooth signals tend to concentrate at the coarsest scales (lowest- 

linear approximations formed from just the lowest frequency elements of 
the Fourier or wavelet dictionaries (i.e., the truncation of the Fourier or 
wavelet representation to only the lowest frequency terms) provide very 
accurate approximations to smooth signals. Put differently, smooth signals 
live near the subspace spanned by just the lowest frequency Fourier or 
wavelet basis functions. 


Sparse (nonlinear) models 


Sparse signal models can be viewed as a generalization of linear models. 
The notion of sparsity comes from the fact that, by the proper choice of 
dictionary YW, many real-world signals = Wa have coefficient vectors a 
containing few large entries, but across different signals the locations 
(indices in a) of the large entries may change. We say a signal is strictly 
sparse (or “A --sparse”) if all but K entries of a@ are zero. 


Some examples of real-world signals for which sparse models have been 
proposed include neural spike trains (in time), music and other audio 
recordings (in time and frequency), natural images (in the wavelet or 
curvelet dictionaries [link], [link], [link], [link], [link], [link], [link], [link]), 
video sequences (in a 3-D wavelet dictionary [link], [link]), and sonar or 
radar pulses (in a chirplet dictionary [link]). In each of these cases, the 
relevant information in a sparse representation of a signal is encoded in 
both the locations (indices) of the significant coefficients and the values to 
which they are assigned. This type of uncertainty is an appropriate model 
for many natural signals with punctuated phenomena. 


Sparsity is a nonlinear model. In particular, let 'x denote the set of all K- 
sparse signals for a given dictionary. It is easy to see that the set >’ is not 
closed under addition. (In fact, X};% + X77 = 2x.) From a geometric 
perspective, the set of all K-sparse signals from the dictionary Y forms not 
a hyperplane but rather a union of k-dimensional hyperplanes, each 
spanned by K vectors of W (see [link](b)). For a dictionary Y with Z 


entries, there are ( : such hyperplanes. (The geometry of sparse signal 


collections has also been described in terms of orthosymmetric sets; 
see [link].) 


Signals that are not strictly sparse but rather have a few “large” and many 
“small” coefficients are known as compressible signals. The notion of 
compressibility can be made more precise by considering the rate at which 
the sorted magnitudes of the coefficients a decay, and this decay rate can in 
turn be related to the £, norm of the coefficient vector a. Letting @ denote a 
rearrangement of the vector a with the coefficients ordered in terms of 
decreasing magnitude, then the reordered coefficients satisfy [link] 


Equation: 


a < | og”, 


decay rates play an important role in nonlinear approximation, where 
adaptive, K-sparse representations from the dictionary are used to 
approximate a signal. 


We recall from [link] that for a smooth signal f, the largest Fourier and 
wavelet coefficients tend to cluster at coarse scales (low frequencies). 
Suppose, however, that the function f is piecewise smooth; i.e., it is CG at 
every point t € IR except for one point fo, at which it is discontinuous. 
Naturally, this phenomenon will be reflected in the transform coefficients. 
In the Fourier domain, this discontinuity will have a global effect, as the 
overall smoothness of the function f has been reduced dramatically from H 
to 0. Wavelet coefficients, however, depend only on local signal properties, 
and so the wavelet basis functions whose supports do not include to will be 
unaffected by the discontinuity. Coefficients surrounding the singularity 
will decay only as 2-4/2, but there are relatively few such coefficients. 
Indeed, at each scale there are only O(1) wavelets that include to in their 
supports, but these locations are highly signal-dependent. (For modeling 
purposes, these significant coefficients will persist through scale down the 
parent-child tree structure.) After reordering by magnitude, the wavelet 
coefficients of piecewise smooth signals will have the same general decay 
rate as those of smooth signals. In Nonlinear Approximation from 
Approximation, we see that the quality of nonlinear approximations offered 
by wavelets for smooth 1-D signals is not hampered by the addition of a 
finite number of discontinuities. 


Manifold models 


Manifold models generalize the conciseness of sparsity-based signal 
models. In particular, in many situations where a signal is believed to have a 
concise description or “few degrees of freedom,” the result is that the signal 
will live on or near a particular submanifold of the ambient signal space. 


Parametric models 


We begin with an abstract motivation for the manifold perspective. 
Consider a signal f (such as a natural image), and suppose that we can 
identify some single 1-D piece of information about that signal that could 
be variable; that is, other signals might rightly be called “similar” to f if 
they differ only in this piece of information. (For example, this 1-D 
parameter could denote the distance from some object in an image to the 
camera.) We let 8 denote the variable parameter and write the signal as fg to 
denote its dependence on @. In a sense, 0 is a single “degree of freedom” 
driving the generation of the signal fg under this simple model. We let O 
denote the set of possible values of the parameter 0. If the mapping between 
6 and fg is well-behaved, then the collection of signals { fg : 9 € O} forms 
a 1-D path in the ambient signal space. 


More generally, when a signal has AK degrees of freedom, we may model it 
as depending on some parameter @ that is chosen from a K-dimensional 
manifold ©. (The parameter space @ could be, for example, a subset of R*, 
or it could be a more general manifold such as SO(3).) We again let fg 
denote the signal corresponding to a particular choice of 0, and we let 

F ={fo: 0 € O}. Assuming the mapping f is continuous and injective 
over O (and its inverse is continuous), then by virtue of the manifold 
structure of O, its image ¥ will correspond to a k-dimensional manifold 
embedded in the ambient signal space (see [link ](c)). 


These types of parametric models arise in a number of scenarios in signal 
processing. Examples include: signals of unknown translation, sinusoids of 
unknown frequency (across a continuum of possibilities), linear radar chirps 
described by a starting and ending time and frequency, tomographic or light 
field images with articulated camera positions, robotic systems with few 
physical degrees of freedom, dynamical systems with low-dimensional 
attractors [link], [link], and so on. 


In general, parametric signals manifolds are nonlinear (by which we mean 
non-affine as well); this can again be seen by considering the sum of two 
signals fg, + fo,. In many interesting situations, signal manifolds are non- 
differentiable as well. 


Nonparametric models 


Manifolds have also been used to model signals for which there is no 
known parametric model. Examples include images of faces and 
handwritten digits [link], [link], which have been found empirically to 
cluster near low-dimensional manifolds. Intuitively, because of the 
configurations of human joints and muscles, it may be conceivable that 
there are relatively “few” degrees of freedom driving the appearance of a 
human face or the style of handwriting; however, this inclination is difficult 
or impossible to make precise. Nonetheless, certain applications in face and 
handwriting recognition have benefitted from algorithms designed to 
discover and exploit the nonlinear manifold-like structure of signal 
collections. Manifold Learning from Dimensionality Reduction discusses 
such methods for learning parametrizations and other information from data 
living along manifolds. 


Much more generally, one may consider, for example, the set of all natural 
images. Clearly, this set has small volume with respect to the ambient signal 
space — generating an image randomly pixel-by-pixel will almost certainly 
produce an unnatural noise-like image. Again, it is conceivable that, at least 
locally, this set may have a low-dimensional manifold-like structure: from a 
given image, one may be able to identify only a limited number of 
meaningful changes that could be performed while still preserving the 
natural look to the image. Arguably, most work in signal modeling could be 
interpreted in some way as a Search for this overall structure. 


Approximation 

This collection reviews fundamental concepts underlying the use of concise 
models for signal processing. Topics are presented from a geometric 
perspective and include low-dimensional linear, sparse, and manifold-based 
signal models, approximation, compression, dimensionality reduction, and 
Compressed Sensing. 


To this point, we have discussed signal representations and models as basic 
tools for signal processing. In the following modules, we discuss the actual 
application of these tools to tasks such as approximation and compression, 
and we continue to discuss the geometric implications. 


Linear approximation 


One common prototypical problem in signal processing is to find the best 
linear approximation to a signal x. By “best linear approximation,” we 
mean the best approximation to z from among a class of signals comprising 
a linear (or affine) subspace. This situation may arise, for example, when 
we have a noisy observation of a signal believed to obey a linear model. If 
we choose an £2 error criterion, the solution to this optimization problem 
has a particularly strong geometric interpretation. 


To be more concrete, suppose S is a K-dimensional linear subspace of R™. 
(The case of an affine subspace follows similarly.) If we seek 


Equation: 


* ° 
s :=argmin ||s — 2[>, 


standard linear algebra results state that the minimizer is given by 
Equation: 


$= A Az, 


where A is a K x N matrix whose rows form an orthonormal basis for S. 
Geometrically, one can easily see that this solution corresponds to an 


orthogonal projection of x onto the subspace S (see [link](a)). 


Approximating a signal 2 € R? with an @g error criterion. (a) Linear 
approximation using one element of the dictionary W corresponds to 
orthogonal projection of the signal onto the linear subspace. 

(b) Nonlinear approximation corresponds to orthogonal projection of 
the signal onto the nearest candidate subspace. In this case, we choose 
the best 1-sparse signal that can be built using W. (c) Manifold-based 
approximation, finding the nearest point on .@. 


x(2) 


The linear approximation problem arises frequently in settings involving 
signal dictionaries. In some settings, such as the case of an oversampled 
bandlimited signal, certain coefficients in the vector a may be assumed to 
be fixed at zero. In the case where the dictionary W forms an orthonormal 
basis, the linear approximation estimate of the unknown coefficients has a 
particularly simple form: rows of the matrix A in [link] are obtained by 
selecting and transposing the columns of Y whose expansion coefficients 
are unknown, and consequently, the unknown coefficients can be estimated 
simply by taking the inner products of x against the appropriate columns of 
WV 


For example, in choosing a fixed subset of the Fourier or wavelet 
dictionaries, one may rightfully choose the lowest frequency (coarsest 
scale) basis functions for the set S' because, as discussed in Linear Models 
from Low-Dimensional Signal Models , the coefficients generally tend to 
decay at higher frequencies (finer scales). For smooth functions, this 
strategy is appropriate and effective; functions in Sobolev smoothness 
Spaces are well-approximated using linear approximations from the Fourier 
or wavelet dictionaries [link]. For piecewise smooth functions, however, 
even the wavelet-domain linear approximation strategy would miss out on 
significant coefficients at fine scales. Since the locations of such 
coefficients are unknown a priority, it is impossible to propose a linear 
wavelet-domain approximation scheme that could simultaneously capture 
all piecewise smooth signals. As an example, [link](a) shows the linear 
approximation of the Cameraman test image obtained by keeping only the 
lowest-frequency scaling and wavelet coefficients. No high-frequency 
information is available to clearly represent features such as edges. 


Linear versus nonlinear approximation in the wavelet domain. (a) 
Linear approximation of the Cameraman test image obtained by 
keeping the kK = 4096 lowest-frequency wavelet coefficients from the 
five-level wavelet decomposition. The MSE with respect to the 
original image is 353. (b) Nonlinear approximation of the Cameraman 
test image obtained by keeping the K = 4096 largest wavelet 
coefficients from the five-level wavelet decomposition. The MSE with 
respect to the original image is 72. Compared with linear 
approximation, more high frequency coefficients are included, which 
allows better representation of features such as edges. 
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Nonlinear approximation 


A related question often arises in settings involving signal dictionaries. 
Rather than finding the best approximation to a signal f using a fixed 
collection of K elements from the dictionary Y, one may often seek the best 
K-term representation to f among all possible expansions that use K terms 


from the dictionary. Compared to linear approximation, this type of 
nonlinear approximation [link], [link] utilizes the ability of the dictionary to 
adapt: different elements may be important for representing different 
signals. 


The /K-term nonlinear approximation problem corresponds to the 
optimization 
Equation: 


* 
$x» ‘=argmin || s — f |l,. 


(For the sake of generality, we consider general L,, and £, norms in this 
section.) Due to the nonlinearity of the set »'x for a given dictionary, 
solving this problem can be difficult. Supposing W is an orthonormal basis 
and p = 2, the solution to [link] is easily obtained by thresholding: one 
simply computes the coefficients a@ and keeps the K largest (setting the 
remaining coefficients to zero). The approximation error is then given 
simply by the coefficients that are discarded: 

Equation: 


{2 


* 
| Sko—f lle = > ay 


k>K 


When W is a redundant dictionary, however, the situation is much more 
complicated. We mention more on this below (see also [link](b)). 


Measuring approximation quality 


One common measure for the quality of a dictionary VY in approximating a 
signal class is the fidelity of its A-term representations. Often one examines 
the asymptotic rate of decay of the K-term approximation error as K grows 
large. Defining 

Equation: 


oK(f)p =ll 8kp— fll, 


for a given signal f we may consider the asymptotic decay of ox(f) p as 


K — oo. (We recall the dependence of [link] and hence [link] on the 
dictionary Y.) In many cases, the function ox(f) p Will decay as K ~" for 


some 7, and when W represents a harmonic dictionary, faster decay rates 
tend to correspond to smoother functions. Indeed, one can show that when 
W is an orthonormal basis, then ox(f),. will decay as kK ~" if and only if a, 
decays as k~"*1/? [link]. 


Nonlinear approximation of piecewise smooth functions 


Let f € @" bea 1-D function. Supposing the wavelet dictionary has more 
than H vanishing moments, then f can be well approximated using its K 
largest coefficients (most of which are at coarse scales). As K grows large, 
the nonlinear approximation error will decay[footnote] as ox(f), < K~”. 
We use the notation f(@) S g(a), or f(a) = O(g(a)), if there exists a 
constant C’,, possibly large but not dependent on the argument a, such that 


f(a) < Co(a). 


Supposing that f is piecewise smooth, however, with a finite number of 
discontinuities, then (as discussed in Sparse (Nonlinear) Models from Low- 
Dimensional Signal Models) f will have a limited number of significant 
wavelet coefficients at fine scales. Because of the concentration of these 
significant coefficients within each scale, the nonlinear approximation rate 
will remain ox(f), < K~™ as if there were no discontinuities present 
[link]. 


Unfortunately, this resilience of wavelets to discontinuities does not extend 
to higher dimensions. Suppose, for example, that f isa @ smooth 2-D 
signal. Assuming the proper number of vanishing moments, a wavelet 
representation will achieve the optimal nonlinear approximation rate 
ox(f), S$ K~”’ [link], [link]. As in the 1-D case, this approximation rate 


is maintained when a finite number of point discontinuities are introduced 
into f. However, when f contains 1-D discontinuities (edges separating the 
smooth regions), the approximation rate will fall toox(f), < K =e 
[link]. The problem actually arises due to the isotropic, dyadic supports of 
the wavelets; instead of O(1) significant wavelets at each scale, there are 
now O(2/ ) wavelets overlapping the discontinuity. We revisit this 
important issue in Compression. 


Despite the limited approximation capabilities for images with edges, 
nonlinear approximation in the wavelet domain typically offers a superior 
approximation to an image compared to linear approximation in the wavelet 
domain. As an example, [link](b) shows the nonlinear approximation of the 
Cameraman test image obtained by keeping the largest scaling and wavelet 
coefficients. In this case, a number of high-frequency coefficients are 
selected, which gives an improved ability to represent features such as 
edges. Better concise transforms, which capture the image information in 
even fewer coefficients, would offer further improvements in terms of 
nonlinear approximation quality. 


Finding approximations 


As mentioned above, in the case where W is an orthonormal basis and p = 2 
, the solution to [link] is easily obtained by thresholding: one simply 
computes the coefficients a and keeps the K largest (setting the remaining 
coefficients to zero). Thresholding can also be shown to be optimal for 
arbitrary £, norms in the special case where W is the canonical basis. While 
the optimality of thresholding does not generalize to arbitrary norms and 
bases, thresholding can be shown to be a near-optimal approximation 
strategy for wavelet bases with arbitrary L,, norms [link]. 


In the case where W is a redundant dictionary, however, the expansion 
coefficients @ are not unique, and the optimization problem [link] can be 
much more difficult to solve. Indeed, supposing even that an exact K-term 
representation exists for f in the dictionary VY, finding that K-term 
approximation is NP-hard in general, requiring a combinatorial enumeration 


of the ( x) possible sparse subspaces [link]. This search can be recast as 


the optimization problem 
Equation: 


@ =argmin || a ||) s.t. f = Wa. 


While solving [link] is prohibitively complex, a variety of algorithms have 
been proposed as alternatives. One approach convexifies the optimization 
problem by replacing the £9 fidelity criterion by an @; criterion 

Equation: 


@=argmin || a ||, s.t. f = Wa. 


This problem, known as Basis Pursuit [link], is significantly more 
approachable and can be solved with traditional linear programming 
techniques whose computational complexities are polynomial in Z. The £; 
criterion has the advantage of yielding a convex optimization problem 
while still encouraging sparse solutions due to the polytope geometry of the 
£, unit ball (see for example [link] and [link]). Iterative greedy algorithms 
such as Matching Pursuit (MP) and Orthogonal Matching Pursuit (OMP) 
[link] have also been suggested to find sparse representations a for a signal 
f. Both MP and OMP iteratively select the columns from W that are most 
correlated with f, then subtract the contribution of each column, leaving a 
residual. OMP includes an additional step at each iteration where the 
residual is orthogonalized against the previously selected columns. 


Manifold approximation 


We also consider the problem of finding the best manifold-based 
approximation to a signal (see [link](c)). Suppose that F = {fg : 0 € O} 
is a parametrized A-dimension manifold and that we are given a signal [ 
that is believed to approximate f» for an unknown @ € O. From I we wish 
to recover an estimate of 9. Again, we may formulate this parameter 
estimation problem as an optimization, writing the objective function (here 
we concentrate solely on the LD or £2 case) 


Equation: 


D (9) = |lfo — Tl3 


and solving for 
Equation: 


6 =argmin D (9). 


We suppose that the minimum is uniquely defined. 


Standard nonlinear parameter estimation [link] tells us that, if D is 
differentiable, we can use Newton's method to iteratively refine a sequence 
of guesses 6 9) 6)... to 6” and rapidly convergence to the true 
value. Supposing that ¥ is a differentiable manifold, we would let 
Equation: 


J =(8D/00) 8D/00; --- OD/00x-1\" 


be the gradient of D, and let H be the K x K Hessian, H;; = ah ; 
iOU; 


Assuming D is differentiable, Newton's method specifies the following 
update step: 
Equation: 


a) <9) + | (6) “3 (o). 


To relate this method to the structure of the manifold, we can actually 
express the gradient and Hessian in terms of signals, writing 
Equation: 


D(6) =|| fe—I |I =f (fe-1 d= f f3-21fo+P dz. 


Differentiating with respect to component 6;, we obtain 


Equation: 
OD O 9 9 
0, —— a 26, ( / fo —2I fo +] i) 


O O 
= / 2feTs — 217, dx 


2( fo — 1,79), 


where oe = ts is a tangent signal. Continuing, we examine the Hessian, 


Equation: 
07D d (OD 
——- —- H; = ——-(— 
06,00; 00; \ 00; 
0 i i 
| oe (2for4 = 2174) dx 
= [rir + fore = 217; dx 


= 2 (ri, 74) +2(fo—I,74'), 


i 2 
where o = a fe denotes a second-derivative signal. Thus, we can 
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interpret Newton's method geometrically as (essentially) a sequence of 
successive projections onto tangent spaces on the manifold. 


Again, the above discussion assumes the manifold to be differentiable. 
Many interesting parametric signal manifolds are in fact nowhere 
differentiable — the tangent spaces demanded by Newton's method do not 
exist. However, in [link] we have identified a type of multiscale tangent 


structure to the manifold that permits a coarse-to-fine technique for 
parameter estimation. 


Compression 

This collection reviews fundamental concepts underlying the use of concise 
models for signal processing. Topics are presented from a geometric 
perspective and include low-dimensional linear, sparse, and manifold-based 
signal models, approximation, compression, dimensionality reduction, and 
Compressed Sensing. 


Transform coding 


of a dictionary in terms of its A-term approximations to signals drawn from 
some class. One reason that such approximations are desirable is that they 
provide concise descriptions of the signal that can be easily stored, 
processed, etc. There is even speculation and evidence that neurons in the 
human visual system may use sparse coding to represent a scene [link]. 


For data compression, conciseness is often exploited in a popular technique 
known as transform coding. Given a signal f (for which a concise 
description may not be readily apparent in its native domain), the idea is 
simply to use the dictionary W to transform f to its coefficients a@, which 
can then be efficiently and easily described. As discussed above, perhaps 
the simplest strategy for summarizing a sparse a is simply to threshold, 
keeping the K largest coefficients and discarding the rest. A simple encoder 
would then just encode the positions and quantized values of these K’ 
coefficients. 


Metric entropy 


Suppose f is a function and let fr be an approximation to f encoded using 
R bits. To evaluate the quality of a coding strategy, it is common to 
consider the asymptotic rate-distortion (R-D) performance, which 


measures the decay rate of || f —fr I|z, as R — oo. The metric 

entropy [link] for a class F gives the best decay rate that can be achieved 
uniformly over all functions f € ¥. We note that this is a true measure for 
the complexity of a class and is tied to no particular dictionary or encoding 
strategy. The metric entropy also has a very geometric interpretation, as it 


relates to the smallest radius possible for a covering of 2” balls over the set 


F. 


Metric entropies are known for certain signal classes. For example, the 
results of Clements [link] (extending those of Kolmogorov and 

Tihomirov [link]) regarding metric entropy give bounds on the optimal 
achievable asymptotic rate-distortion performance for D-dimensional @ #- 
smooth functions f (see also [link]): 

Equation: 


dls 


— 1 
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Rate-distortion performance measures the complexity of a representation 
and encoding strategy. In the case of transform coding, for example, R-D 
results account for the bits required to encode both the values of the 
significant coefficients and their locations. Nonetheless, in many cases 
transform coding is indeed an effective strategy for encoding signals that 
have sparse representations [link]. For example, in [link] Cohen et al. 
propose a wavelet-domain coder that uses a connected-tree structure to 
efficiently encode the positions of the significant coefficients and prove that 
this encoding strategy achieves the optimal rate 

Equation: 


ols 
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Compression of piecewise smooth images 


In some cases, however, the sparsity of the wavelet transform may not 
reflect the true underlying structure of a signal. Examples are 2-D piecewise 
smooth signals with a smooth edge discontinuity separating the smooth 


wavelets fail to sparsely represent these functions, and so the R-D 


performance for simple thresholding-based coders will suffer as well. In 
spite of all of the benefits of wavelet representations for signal processing 
(low computational complexity, tree structure, sparse approximations for 
smooth signals), this failure to efficiently represent edges is a significant 
drawback. In many images, edges carry some of the most prominent and 
important information [link], and so it is desirable to have a representation 
well-suited to compressing edges in images. 


To address this concern, recent work in harmonic analysis has focused on 
developing representations that provide sparse decompositions for certain 
geometric image classes. Examples include curvelets [link], [link] and 
contourlets [link], slightly redundant tight frames consisting of anisotropic, 
“needle-like” atoms. In [link], bandelets are formed by warping an 
orthonormal wavelet basis to conform to the geometrical structure in the 
image. A nonlinear multiscale transform that adapts to discontinuities (and 
can represent a “clean” edge using very few coarse scale coefficients) is 
proposed in [link]. Each of these new representations has been shown to 
achieve near-optimal asymptotic approximation and R-D performance for 
piecewise smooth images consisting of C# regions separated by 
discontinuities along C# curves, with H = 2 (H > 2 for bandelets). Some 
have also found use in specialized compression applications such as 
identification photos [link]. 


In [link], we have presented a scheme that is based on the simple yet 
powerful observation that geometric features can be efficiently 
approximated using local, geometric atoms in the spatial domain, and that 
the projection of these geometric primitives onto wavelet subspaces can 
therefore approximate the corresponding wavelet coefficients. We prove 
that the resulting dictionary achieves the optimal nonlinear approximation 
rates for piecewise smooth signal classes. To account for the added 
complexity of this encoding strategy, we also consider R-D results and 
prove that this scheme comes within a logarithmic factor of the optimal 
performance rate. Unlike the techniques mentioned above, our method also 
generalizes to arbitrary orders of smoothness and arbitrary signal 
dimension. 


Dimensionality Reduction 

This collection reviews fundamental concepts underlying the use of concise 
models for signal processing. Topics are presented from a geometric 
perspective and include low-dimensional linear, sparse, and manifold-based 
signal models, approximation, compression, dimensionality reduction, and 
Compressed Sensing. 


Recent years have seen a proliferation of novel techniques for what can 
loosely be termed “dimensionality reduction.” Like the tasks of 
approximation and compression discussed above, these methods involve 
some aspect in which low-dimensional information is extracted about a 
signal or collection of signals in some high-dimensional ambient space. 
Unlike the tasks of approximation and compression, however, the goal of 
these methods is not always to maintain a faithful representation of each 
signal. Instead, the purpose may be to preserve some critical relationships 
among elements of a data set or to discover information about a manifold 
on which the data lives. 


In this section, we review two general methods for dimensionality 
reduction. [link] begins with a brief overview of techniques for manifold 
learning. [link] then discusses the Johnson-Lindenstrauss (JL) lemma, 
which concerns the isometric embedding of a cloud points as it is projected 
to a lower-dimensional space. Though at first glance the JL lemma does not 
pertain to any of the low-dimensional signal models we have previously 
discussed, we later see in Connections with dimensionality reduction that 
the JL lemma plays a critical role in the core theory of CS, and we also 
employ the JL lemma in developing a theory for isometric embeddings of 
manifolds. 


Manifold learning 


Several techniques have been proposed for solving a problem known as 
manifold learning in which certain properties of a manifold are inferred 
from a discrete collection of points sampled from that manifold. A typical 
manifold learning setup is as follows: an algorithm is presented with a set of 
P points sampled from a K-dimensional submanifold of R. The goal of 
the algorithm is to produce an mapping of these P points into some lower 


dimension R™ (ideally, M = K) while preserving some characteristic 
property of the manifold. Example algorithms include ISOMAP [link], 
Hessian Eigenmaps (HLLE) [link], and Maximum Variance Unfolding 
(MVU) [link], which attempt to learn isometric embeddings of the manifold 
(thus preserving pairwise geodesic distances in R™); Locally Linear 
Embedding (LLE) [link], which attempts to preserve local linear 
neighborhood structures among the embedded points; Local Tangent Space 
Alignment (LTSA) [link], which attempts to preserve local coordinates in 
each tangent space; and a method for charting a manifold [link] that 
attempts to preserve local neighborhood structures. 


The internal mechanics of these algorithms differs depending on the 
objective criterion to be preserved, but as an example, the ISOMAP 
algorithm operates by first estimating the geodesic distance between each 
pair of points on the manifold (by approximating geodesic distance as the 
sum of Euclidean distances between pairs of the available sample points). 
After the P x P matrix of pairwise geodesic distances is constructed, a 
technique known as multidimensional scaling uses an eigendecomposition 
of the distance matrix to determine the proper /-dimensional embedding 
space. An example of using ISOMAP to learn a 2-dimensional manifold is 
shown in [link]. 


Manifold learning demonstration. (a) As input to the manifold 
learning algorithm, 1000 images of size 64 x 64 are created, where 
each image consists of a white disk translated to a random position 

(01, 02). It follows that the images represent a sampling of 1000 
points from a 2-dimensional submanifold of R*°%*. (b) Scatter plot of 
the true values for the (01, 62) positions. For visibility in each plot, 
the color of each point indicates the true 0; value. (c) ISOMAP 


embedding learned from original data points in R4°°®, From the low- 
dimensional embedding coordinates we can infer the relative 
positions of the original high-dimensional images. (d) ISOMAP 
embedding learned from a random projection of the data set to R™, 
where M = 15. 


These algorithms can be useful for learning the dimension and 
parametrizations of manifolds, for sorting data, for visualization and 
navigation through the data, and as preprocessing to make further analysis 
more tractable; common demonstrations include analysis of face images 
and classification of and handwritten digits. A related technique, the 
Whitney Reduction Network [link], [link], seeks a linear mapping to R” 
that preserves ambient pairwise distances on the manifold and is 
particularly useful for processing the output of dynamical systems having 
low-dimensional attractors. 


Other algorithms have been proposed for characterizing manifolds from 
sampled data without constructing an explicit embedding in R™. The 
Geodesic Minimal Spanning Tree (GMST) [link] models the data as 
random samples from the manifold and estimates the corresponding entropy 
and dimensionality. Another technique [link] has been proposed for using 
random samples of a manifold to estimate its homology (via the Betti 
numbers, which essentially characterize its dimension, number of connected 
components, etc.). Persistence Barcodes [link] are a related technique that 
involves constructing a type of signature for a manifold (or simply a shape) 
that uses tangent complexes to detect and characterize local edges and 
comers. 


Additional algorithms have been proposed for constructing meaningful 
functions on the point samples in R™. To solve a semi-supervised learning 
problem, a method called Laplacian Eigenmaps [link] has been proposed 
that involves forming an adjacency graph for the data in RY, computing 
eigenfunctions of the Laplacian operator on the graph (which form a basis 
for Lz on the graph), and using these functions to train a classifier on the 
data. The resulting classifiers have been used for handwritten digit 


recognition, document classification, and phoneme classification. (The MZ 
smoothest eigenfunctions can also be used to embed the manifold in M, 
similar to the approaches described above.) A related method called 
Diffusion Wavelets [link] uses powers of the diffusion operator to model 
scale on the manifold, then constructs wavelets to capture local behavior at 
each scale. The result is a wavelet transform adapted not to geodesic 
distance but to diffusion distance, which measures (roughly) the number of 
paths connecting two points. 


The Johnson-Lindenstrauss lemma 


Fundamentals 


As with the above techniques in manifold learning, the Johnson- 
Lindenstrauss (JL) lemma [link], [link], [link], [link] provides a method for 
dimensionality reduction of a set of data in R’. Unlike manifold-based 
methods, however, the JL lemma can be used for any arbitrary set Q of 
points in RX; the data set is not assumed to have any a priori structure. 


Despite the apparent lack of structure in an arbitrary point cloud data set, 
the JL lemma suggests that there does exist a method for dimensionality 
reduction of that data set that can preserve key information while mapping 
the data to a lower-dimensional space R™. In particular, the original 
formulation of the JL lemma [link] states that there exists a Lipschitz 
mapping  : RY +> R™” with M = O(log (#Q)) such that all pairwise 
distances between points in Q are approximately preserved. This fact is 
useful for solving problems such as Approximate Nearest 

Neighbor [link], in which one desires the nearest point in @ to some query 
point y € R% (but a solution not much further than the optimal point is also 
acceptable). Such problems can be solved significantly more quickly in R™ 
than in RY. 


Recent reformulations of the JL lemma propose random linear operators 
that, with high probability, will ensure a near isometric embedding. These 
typically build on concentration of measure results such as the following. 
Lemma 


[link], [link] Let « € RY, fix 0 < e€ < 1, and let & bea matrix constructed 
in one of the following two manners: 


1.@isarandom M x N matrix with i.i.d. WY (0, o”) entries, where 
o” = 1/N, or 
2. & is random orthoprojector from R% to R™. 


Then with probability exceeding 


Equation: 
( M (e?/2 — e?/3) 
1 —2exp | -————_.———_ ], 
2 

the following holds: 
Equation: 

M . |\€z\l, M 

(1—e)4/— < < (1+e),/—. 
N ~ |zllo N 


The random orthoprojector referred to above is clearly related to the first 
case (simple matrix multiplication by a Gaussian #) but subtly different; 
one could think of constructing a random Gaussian @, then using Gram- 
Schmidt to orthonormalize the rows before multiplying x. We note also that 


simple rescaling of ® can be used to eliminate the i/ A in [link]; however 


we prefer this formulation for later reference. 


By using the union bound over all ey pairs of distinct points in Q, 


Lemma "The Johnson-Lindenstrauss lemma” can be used to prove a 
randomized version of the Johnson-Lindenstrauss lemma. 

Lemma 

Johnson-Lindenstrauss 


Let Q be a finite collection of points in R”. Fix 0 < e < land 8 > 0. Set 


Equation: 


M > (a) In (#Q). 


Let & be a matrix constructed in one of the following two manners: 


1.@isarandom M x N matrix with i.i.d. YW (0, a”) entries, where 
o =1/N,or 
2. @ is random orthoprojector from R™ to R™. 


Then with probability exceeding 1 — (#Q)° , the following statement 
holds: for every xz, y € Q, 
Equation: 


M _ |\ér-— M 
a—oyfM < Wen Fuh capo /M 
N ~~ |e—gllp N 


Indeed, [link] establishes that both [link] and [link] also hold when the 
elements of @ are chosen i.i.d. from a random Rademacher distribution (+o 
with equal probability 1/2) or from a similar ternary distribution (+ /30 
with equal probability 1/6; 0 with probability 2/3). These can further 
improve the computational benefits of the JL lemma. 


Connections with compressed sensing 


In the following module on Compressed Sensing we will discuss further 
topics in dimensionality reduction that relate to the JL lemma. In particular, 
as discussed in Connections with dimensionality reduction, the core 
mechanics of Compressed Sensing can be interpreted in terms of a stable 
embedding that arises for the family of AK-sparse signals when observed 
with random measurements, and this stable embedding can be proved using 
the JL lemma. Furthermore, as discussed in Stable embeddings of 


manifolds, one can ensure a stable embedding of families of signals 
obeying manifold models under a sufficient number of random projections, 
with the theory again following from the JL lemma. 


Compressed Sensing 

This collection reviews fundamental concepts underlying the use of concise 
models for signal processing. Topics are presented from a geometric 
perspective and include low-dimensional linear, sparse, and manifold-based 
signal models, approximation, compression, dimensionality reduction, and 
Compressed Sensing. 


A new theory known as Compressed Sensing (CS) has recently emerged 
that can also be categorized as a type of dimensionality reduction. Like 
manifold learning, CS is strongly model-based (relying on sparsity in 
particular). However, unlike many of the standard techniques in 
dimensionality reduction (such as manifold learning or the JL lemma), the 
goal of CS is to maintain a low-dimensional representation of a signal x 
from which a faithful approximation to z can be recovered. In a sense, this 
more closely resembles the traditional problem of data compression (see 
Compression). In CS, however, the encoder requires no a priori knowledge 
of the signal structure. Only the decoder uses the model (sparsity) to 
recover the signal. We justify such an approach again using geometric 
arguments. 


Motivation 


Consider a signal z € R%, and suppose that the basis Y provides a K- 
sparse representation of x 
Equation: 


x = Wa, 


with || a ||) = K. (In this section, we focus on exactly K-sparse signals, 
though many of the key ideas translate to compressible signals [link], [link]. 
In addition, we note that the CS concepts are also extendable to tight 
frames.) 


As we discussed in Compression, the standard procedure for compressing 
sparse signals, known as transform coding, is to (i) acquire the full N- 
sample signal x; (ii) compute the complete set of transform coefficients a; 


(iii) locate the K largest, significant coefficients and discard the (many) 
small coefficients; (iv) encode the values and locations of the largest 
coefficients. 


This procedure has three inherent inefficiencies: First, for a high- 
dimensional signal, we must start with a large number of samples NV. 
Second, the encoder must compute all N of the transform coefficients a, 
even though it will discard all but A of them. Third, the encoder must 
encode the locations of the large coefficients, which requires increasing the 
coding rate since the locations change with each signal. 


Incoherent projections 


This raises a simple question: For a given signal, is it possible to directly 
estimate the set of large a(m)'s that will not be discarded? While this seems 
improbable, Candés, Romberg, and Tao [link], [link] and Donoho [link] 
have shown that a reduced set of projections can contain enough 
information to reconstruct sparse signals. An offshoot of this work, often 
referred to as Compressed Sensing (CS) [link], [link], [link], [link], [link], 
[link], [link], has emerged that builds on this principle. 


In CS, we do not measure or encode the K significant a(n) directly. 
Rather, we measure and encode M < N projections y(m) =< 2, Ym! > 
of the signal onto a second set of functions {y,,},m = 1, 2,..., M.In 
matrix notation, we measure 

Equation: 


y= $2, 


where y is an / x 1 column vector and the measurement basis matrix 
is M x N with each row a basis vector ¢,,,. Since M < N, recovery of the 
signal x from the measurements y is ill-posed in general; however the 
additional assumption of signal sparsity makes recovery possible and 
practical. 


The CS theory tells us that when certain conditions hold, namely that the 
functions {y,,} cannot sparsely represent the elements of the basis {w,,} (a 
condition known as incoherence of the two dictionaries [link], [link], 
[link], [link]) and the number of measurements / is large enough, then it is 
indeed possible to recover the set of large {a:(m) } (and thus the signal x) 
from a similarly sized set of measurements y. This incoherence property 
holds for many pairs of bases, including for example, delta spikes and the 
sine waves of a Fourier basis, or the Fourier basis and wavelets. 
Significantly, this incoherence also holds with high probability between an 
arbitrary fixed basis and a randomly generated one. 


Methods for signal recovery 


Although the problem of recovering x from y is ill-posed in general 
(because x € RY, y € R™, and M < N), it is indeed possible to recover 
sparse signals from CS measurements. Given the measurements y = @z, 
there exist an infinite number of candidate signals in the shifted nullspace 
AN (&) + « that could generate the same measurements y (see Linear 
Models from Low-Dimensional Signal Models). Recovery of the correct 
signal x can be accomplished by seeking a sparse solution among these 
candidates. 


Recovery via combinatorial optimization 


Supposing that x is exactly K-sparse in the dictionary WY, then recovery of 
x from y can be formulated as the 2) minimization 
Equation: 


@ =argmin || a ||) s.t. y= Wa. 


Given some technical conditions on @ and W (see Theorem [link ]below), 
then with high probability this optimization problem returns the proper K - 
sparse solution a, from which the true z may be constructed. (Thanks to the 
incoherence between the two bases, if the original signal is sparse in the a 


coefficients, then no other set of sparse signal coefficients a’ can yield the 
Same projections y.) We note that the recovery program [link] can be 
interpreted as finding a K-term approximation to y from the columns of the 
dictionary YW, which we call the holographic basis because of the 
complex pattern in which it encodes the sparse signal coefficients [link]. 


In principle, remarkably few incoherent measurements are required to 
recover a K-sparse signal via £9) minimization. Clearly, more than K 
measurements must be taken to avoid ambiguity; the following theorem 
(which is proved in [link]) establishes that A + 1 random measurements 
will suffice. (Similar results were established by Venkataramani and 
Bresler [link].) 

Theorem 


Let W be an orthonormal basis for RY, and let 1 < K < N. Then the 
following statements hold: 


1. Let be an M x N measurement matrix with i.i.d. Gaussian entries 
with M > 2K. Then with probability one the following statement 
holds: all signals 2 = Wa having expansion coefficients a € R% that 
satisfy || @ ||) = K can be recovered uniquely from the M- 
dimensional measurement vector y = @z via the £9 optimization 
[link]. 

2. Let ¢ = Wa such that || a ||) = K. Let @ be an M x N measurement 
matrix with i.i.d. Gaussian entries (notably, independent of x) with 
M > K +1. Then with probability one the following statement holds: 
x can be recovered uniquely from the (/-dimensional measurement 
vector y = @z via the £9 optimization [link]. 

3. Let 2 be an M x N measurement matrix, where / < Kk. Then, aside 
from pathological cases (specified in the proof), no signal x = Wa 
with || @ ||, = can be uniquely recovered from the M/-dimensional 
measurement vector y = @z. 


The second statement of the theorem differs from the first in the following 
respect: when kK < M < 2K, there will necessarily exist K’-sparse signals 
x that cannot be uniquely recovered from the M/-dimensional measurement 
vector y = x. However, these signals form a set of measure zero within 


the set of all/-sparse signals and can safely be avoided if is randomly 
generated independently of z. 


Unfortunately, as discussed in Nonlinear Approximation from 
Approximation, solving this 29 optimization problem is prohibitively 
complex. Yet another challenge is robustness; in the setting of Theorem 
"Recovery via £ 0 optimization", the recovery may be very poorly 
conditioned. In fact, both of these considerations (computational 
complexity and robustness) can be addressed, but at the expense of slightly 
more measurements. 


Recovery via convex optimization 


The practical revelation that supports the new CS theory is that it is not 
necessary to solve the £9-minimization problem to recover a. In fact, a 
much easier problem yields an equivalent solution (thanks again to the 
incoherency of the bases); we need only solve for the £;-sparsest 
coefficients a that agree with the measurements y [link], [link], [link], 
[link], [link], [link], [Link], [link] 

Equation: 


@ =argmin ||a ||, s.t. y= Wa. 


optimization problem, also known as Basis Pursuit [link], is significantly 
more approachable and can be solved with traditional linear programming 
techniques whose computational complexities are polynomial in NV. 


There is no free lunch, however; according to the theory, more than K + 1 
measurements are required in order to recover sparse signals via Basis 
Pursuit. Instead, one typically requires IJ > ck measurements, where 

c > 1 is an oversampling factor. As an example, we quote a result 
asymptotic in NV. For simplicity, we assume that the sparsity scales linearly 
with N; that is, kK = SN, where we call S the sparsity rate. 

Theorem 


[link], [link], [link] Set kK = SN with0O < S < 1. Then there exists an 
oversampling factor c(.S) = O(log (1/9)), c(.S) > 1, such that, for a K- 
sparse signal x in the basis Y, the following statements hold: 


1. The probability of recovering x via Basis Pursuit from (c(.S') + €)K 
random projections, € > 0, converges to one as V — oo. 

2. The probability of recovering x via Basis Pursuit from (c(S') — €)K 
random projections, € > 0, converges to zero as N — oo. 


In an illuminating series of recent papers, Donoho and Tanner [link], [link], 
[link] have characterized the oversampling factor c(S) precisely (see also 
"The geometry of Compressed Sensing"). With appropriate oversampling, 
reconstruction via Basis Pursuit is also provably robust to measurement 
noise and quantization error [link]. 


We often use the abbreviated notation c to describe the oversampling factor 
required in various settings even though c(S) depends on the sparsity A 
and signal length NV. 


A CS recovery example on the Cameraman test image is shown in [Link]. In 
this case, with MZ = 4K we achieve near-perfect recovery of the sparse 
measured image. 
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Compressive sensing reconstruction of the 
nonlinear approximation Cameraman image from 
[link](b). Using IZ = 16384 random 
measurements of the K-term nonlinear 
approximation image (where K = 4096), we 
solve an £;-minimization problem to obtain the 
reconstruction shown above. The MSE with 
respect to the measured image is 0.08, so the 
reconstruction is virtually perfect. 


Recovery via greedy pursuit 


At the expense of slightly more measurements, iterative greedy algorithms 
such as Orthogonal Matching Pursuit (OMP) [link], Matching Pursuit (MP) 
[link], and Tree Matching Pursuit (TMP) [link], [link] have also been 
proposed to recover the signal x from the measurements y (see Nonlinear 


c © 2 In (N){link] to succeed with high probability. OMP is also 


guaranteed to converge within M iterations. We note that Tropp and Gilbert 
require the OMP algorithm to succeed in the first A iterations [link]; 
however, in our simulations, we allow the algorithm to run up to the 
maximum of M possible iterations. The choice of an appropriate practical 
stopping criterion (likely somewhere between K and MM iterations) is a 
subject of current research in the CS community. 


Impact and applications 


CS appears to be promising for a number of applications in signal 
acquisition and compression. Instead of sampling a K-sparse signal N 
times, only cK incoherent measurements suffice, where K can be orders of 
magnitude less than NV. Therefore, a sensor can transmit far fewer 
measurements to a receiver, which can reconstruct the signal and then 
process it in any manner. Moreover, the cK measurements need not be 
manipulated in any way before being transmitted, except possibly for some 
quantization. Finally, independent and identically distributed (i.i.d.) 
Gaussian or Bernoulli/Rademacher (random +1) vectors provide a useful 
universal basis that is incoherent with all others. Hence, when using a 
random basis, CS is universal in the sense that the sensor can apply the 
Same Measurement mechanism no matter what basis the signal is sparse in 
(and thus the coding algorithm is independent of the sparsity-inducing 
basis) [link], [Link], [link]. 


These features of CS make it particularly intriguing for applications in 
remote sensing environments that might involve low-cost battery operated 
wireless sensors, which have limited computational and communication 
capabilities. Indeed, in many such environments one may be interested in 
sensing a collection of signals using a network of low-cost signals. 


Other possible application areas of CS include imaging [link], medical 
imaging [link], [link], and RF environments (where high-bandwidth signals 
may contain low-dimensional structures such as radar chirps) [link]. As 
research continues into practical methods for signal recovery (see [link]), 
additional work has focused on developing physical devices for acquiring 
random projections. Our group has developed, for example, a prototype 
digital CS camera based on a digital micromirror design [link]. Additional 


work suggests that standard components such as filters (with randomized 
impulse responses) could be useful in CS hardware devices [link]. 


The geometry of Compressed Sensing 


It is important to note that the core theory of CS draws from a number of 
deep geometric arguments. For example, when viewed together, the CS 
encoding/decoding process can be interpreted as a linear projection 

® : RY ++ R™ followed by a nonlinear mapping A : R™ +> R*. Ina very 
general sense, one may naturally ask for a given class of signals FA € RY 
(such as the set of K-sparse signals or the set of signals with coefficients 

|| & ||, < 1), what encoder/decoder pair &, A will ensure the best 


reconstruction (minimax distortion) of all signals in Y. This best-case 
performance is proportional to what is known as the Gluskin n-width [link], 
[link] of FY (in our setting n = M), which in tum has a geometric 
interpretation. Roughly speaking, the Gluskin n-width seeks the (IV — n)- 
dimensional slice through -F¥ that yields signals of greatest energy. This n- 
width bounds the best-case performance of CS on classes of compressible 
signals, and one of the hallmarks of CS is that, given a sufficient number of 
measurements this optimal performance is achieved (to within a 

constant) [link], [link]. 


Additionally, one may view the £)/£, equivalence problem geometrically. 
In particular, given the measurements y = @z, we have an (NV — M)- 
dimensional hyperplane %, = {a' € RY : y= G2'} = VW ($) + cof 
feasible signals that could account for the measurements y. Supposing the 
original signal x is K-sparse, the 2; recovery program will recover the 
correct solution x if and only if || x’ ||1> || a ||, for every other signal 

zx’ € 4, 0n the hyperplane. This happens only if the hyperplane %, (which 
passes through «) does not “cut into” the ¢;-ball of radius || z ||,. This ¢1- 
ball is a polytope, on which z belongs to a (K — 1)-dimensional “face.” If 
@ is arandom matrix with i.i.d. Gaussian entries, then the hyperplane #;, 
will have random orientation. To answer the question of how M must relate 
to K in order to ensure reliable recovery, it helps to observe that a randomly 


generated hyperplane # will have greater chance to slice into the @; ball as 
dim(.#) = N — M grows (or as M shrinks) or as the dimension K — 1 


of the face on which z lives grows. Such geometric arguments have been 
made precise by Donoho and Tanner [link], [link], [link] and used to 
establish a series of sharp bounds on CS recovery. 


Connections with dimensionality reduction 


We have also identified [link] a fundamental connection between the CS 
and the JL lemma. In order to make this connection, we considered the 
Restricted Isometry Property (RIP), which has been identified as a key 
property of the CS projection operator @ to ensure stable signal recovery. 
We say & has RIP of order K if for every K-sparse signal z, 

Equation: 


M . |\€ell, M 
(1—e)4/— < < (1+ .6)4/—. 
N IEaIb 


N 


A random M x N matrix with i.i.d. Gaussian entries can be shown to have 
this property with high probability if 4 = O(K log (N/K)). 


While the JL lemma concerns pairwise distances within a finite cloud of 
points, the RIP concerns isometric embedding of an infinite number of 
points (comprising a union of K-dimensional subspaces in R). However, 
the RIP can in fact be derived by constructing an effective sampling of K- 
sparse signals in RY, using the JL lemma to ensure isometric embeddings 
for each of these points, and then arguing that the RIP must hold true for all 
K-sparse signals. (See [link] for the full details.) 


Stable embeddings of manifolds 


Finally, we have also shown that the JL lemma can also lead to extensions 
of CS to other concise signal models. In particular, while conventional CS 
theory concerns sparse signal models, it is also possible to consider 
manifold-based signal models. Just as random projections can preserve the 
low- dimensional geometry (the union of hyperplanes) that corresponds to a 
sparse signal family, random projections can also guarantee a stable 


embedding of a low-dimensional signal manifold. We have the following 
result, which states that an RIP-like property holds for families of manifold- 
modeled signals. 

Theorem 


Let .@ be a compact K-dimensional Riemannian submanifold of RY 
having condition number - , volume V, and geodesic covering regularity 
R. Fix 0 < € < landO < p < 1. Let ® bea random M * N orthoprojector 
with 

Equation: 


Klog (NVRrte) log ( ) 
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If M < N, then with probability at least 1 — p the following statement 
holds: For every pair of points 71, x2€.@, 
Equation: 


M _ ||€x1 — $25]\, M 
(.-.)¥ ee ae (1+) % 


The proof of this theorem appears in [link] and again involves the JL 
lemma. Due to the limited complexity of a manifold model, it is possible to 
adequately characterize the geometry using a sufficiently fine sampling of 
points drawn from the manifold and its tangent spaces. In essence, 
manifolds with higher volume or with greater curvature have more 
complexity and require a more dense covering for application of the JL 
lemma; this leads to an increased number of measurements. The theorem 
also indicates that the requisite number of measurements depends on the 
geodesic covering regularity of the manifold, a minor technical concept 
which is also discussed in [link]. 


This theorem establishes that, like the class of K-sparse signals, a 
collection of signals described by a K-dimensional manifold .@CR* can 
have a stable embedding in an M/-dimensional measurement space. 
Moreover, the requisite number of random measurements / is once again 
linearly proportional to the information level (or number of degrees of 
freedom) K in the concise model. This has a number of possible 
implications for manifold-based signal processing. Manifold-modeled 
signals can be recovered from compressive measurements (using a 
customized recovery algorithm adapted to the manifold model, in contrast 
with sparsity-based recovery algorithms) [link], [link]; unknown parameters 
in parametric models can be estimated from compressive measurements; 
multi-class estimation/classification problems can be addressed [link] by 
considering multiple manifold models; and manifold learning algorithms 
may be efficiently executed by applying them simply to the projection of a 
manifold-modeled data set to a low-dimensional measurement space [link]. 
(As an example, [link](d) shows the result of applying the ISOMAP 
algorithm on a random projection of a data set from IR*°%6 down to R?°; the 
underlying parameterization of the manifold is extracted with little sacrifice 
in accuracy.) In all of this it is not necessary to adapt the sensing protocol to 
the model; the only change from sparsity-based CS would be the methods 
for processing or decoding the measurements. In the future, more 
sophisticated concise models will likely lead to further improvements in 
signal understanding from compressive measurements. 


